AITopics | switch transformer

Collaborating Authors

switch transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Who Speaks for the Trigger Dynamic Expert Routing in Mixture of Experts Transformers

Neural Information Processing SystemsJun-17-2026, 03:42:41 GMT

Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specialized subnetworks, known as experts. However, this sparse routing mechanism inherently exhibits task preferences due to expert specialization, introducing a new and underexplored vulnerability to backdoor attacks. In this work, we investigate the feasibility and effectiveness of injecting backdoors into MoE-based LLMs by exploiting their inherent expert routing preferences. We thus propose BadSwitch, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism. Our approach jointly optimizes trigger embeddings during pretraining while identifying S most sensitive experts, subsequently constraining the Top-K gating mechanism to these targeted experts. Unlike traditional backdoor attacks that rely on superficial data poisoning or model editing, BadSwitch primarily embeds malicious triggers into expert routing paths with strong task affinity, enabling precise and stealthy model manipulation. Through comprehensive evaluations across three prominent MoE architectures (Switch Transformer, QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack pre-trained models with up to 100% success rate (ASR) while maintaining the highest clean accuracy (ACC) among all baselines. Furthermore, BadSwitch exhibits strong resilience against both text-level and model-level defense mechanisms, achieving 94.07%

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.28)
Europe (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mixture-of-ExpertswithExpertChoiceRouting

Neural Information Processing SystemsFeb-8-2026, 03:46:11 GMT

Ourmajorcontributionsinclude: We identify common pitfalls in conventional MoE such as load imbalance as described inSection 3.1.

artificial intelligence, arxivpreprintarxiv, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Gu, Zijin, Likhomanenko, Tatiana, Jaitly, Navdeep

arXiv.org Artificial IntelligenceNov-6-2025

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.05724

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

Ai, Mengting, Wei, Tianxin, Chen, Yifan, Zeng, Zhichen, Zhao, Ritchie, Varatkar, Girish, Rouhani, Bita Darvish, Tang, Xianfeng, Tong, Hanghang, He, Jingrui

arXiv.org Artificial IntelligenceMar-9-2025

Mixture-of-Experts (MoE) Transformer, the backbone architecture The profound impact of the Transformer architecture in the domain of multiple phenomenal language models, leverages sparsity of machine learning is undeniable, for the fields including by activating only a fraction of model parameters for each input natural language processing [3, 14, 18, 45, 48, 61] and computer token. The sparse structure, while allowing constant time costs, vision [17, 39, 64], to name a few. To further improve the capabilities results in space inefficiency: we still need to load all the model of pre-trained large language models (LLMs), one general parameters during inference. We introduce ResMoE, an innovative strategy is to scale up their parameters. Mixture-of-Experts (MoE) MoE approximation framework that utilizes Wasserstein barycenter [52] extends the traditional feedforward neural network (FFN) layer to extract a common expert (barycenter expert) and approximate by replacing a single multilayer perceptron (MLP) with multiple the residuals between this barycenter expert and the original ones. MLPs, referred to as "experts". While enhancing the performance, ResMoE enhances the space efficiency for inference of large-scale sparse MoE keeps computing costs (FLOPs) comparable to the original MoE Transformers in a one-shot and data-agnostic manner without dense model, as only a few selected experts will be activated retraining while maintaining minimal accuracy loss, thereby each time. The framework of an MoE layer is demonstrated in paving the way for broader accessibility to large language models.

mixtral, resmoe, transformer, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3690624.3709196

2503.06881

Country:

North America > Canada > Ontario > Toronto (0.05)
North America > United States > Illinois > Champaign County > Champaign (0.04)
North America > United States > New York > New York County > New York City (0.04)
(12 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Food & Agriculture (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)

Add feedback

Tight Clusters Make Specialized Experts

Nielsen, Stefan K., Teo, Rachel S. Y., Abdullaev, Laziz U., Nguyen, Tan M.

arXiv.org Artificial IntelligenceMar-1-2025

Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

acmoe, conference paper, router, (15 more...)

arXiv.org Artificial Intelligence

2502.15315

Country:

Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Csordás, Róbert, Irie, Kazuki, Schmidhuber, Jürgen

arXiv.org Artificial IntelligenceNov-21-2023

How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.

renorm, softmax, transformer, (17 more...)

arXiv.org Artificial Intelligence

2310.10837

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(12 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Emergent Modularity in Pre-trained Transformers

Zhang, Zhengyan, Zeng, Zhiyuan, Lin, Yankai, Xiao, Chaojun, Wang, Xiaozhi, Han, Xu, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, Zhou, Jie

arXiv.org Artificial IntelligenceOct-30-2023

This work examines the presence of modularity in pre-trained Transformers, a feature commonly found in human brains and thought to be vital for general intelligence. In analogy to human brains, we consider two main characteristics of modularity: (1) functional specialization of neurons: we evaluate whether each neuron is mainly specialized in a certain function, and find that the answer is yes. (2) function-based neuron grouping: we explore finding a structure that groups neurons into modules by function, and each module works for its corresponding function. Given the enormous amount of possible structures, we focus on Mixture-of-Experts as a promising candidate, which partitions neurons into experts and usually activates different experts for different inputs. Experimental results show that there are functional experts, where clustered are the neurons specialized in a certain function. Moreover, perturbing the activations of functional experts significantly affects the corresponding function. Finally, we study how modularity emerges during pre-training, and find that the modular structure is stabilized at the early stage, which is faster than neuron stabilization. It suggests that Transformers first construct the modular structure and then learn fine-grained neuron functions. Our code and data are available at https://github.com/THUNLP/modularity-analysis.

neuron, predictivity, transformer, (17 more...)

arXiv.org Artificial Intelligence

2305.1839

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Du, Zhixu, Li, Shiyu, Wu, Yuhao, Jiang, Xiangyu, Sun, Jingwei, Zheng, Qilin, Wu, Yongkai, Li, Ang, Li, Hai "Helen", Chen, Yiran

arXiv.org Artificial IntelligenceOct-28-2023

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA attains a remarkable speedup in MoE inference with up to 3.93X throughput increasing, up to 75% latency reduction, and up to 80% GPU memory saving with down to 1% performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even in memory-constrained systems.

arxiv preprint arxiv, hash function, sida, (13 more...)

arXiv.org Artificial Intelligence

2310.18859

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Maryland (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Hardware (0.80)

Add feedback

Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

Xu, Haoran, Tan, Weiting, Li, Shuyue Stella, Chen, Yunmo, Van Durme, Benjamin, Koehn, Philipp, Murray, Kenton

arXiv.org Artificial IntelligenceOct-22-2023

Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fully-connected layers. In this work, we introduce the Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices to approximate the full-rank matrix. Furthermore, we condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique to improve the efficiency of inference and model serialization. We show that our LMS method significantly outperforms previous LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73 BLEU points over the Switch Transformer on many-to-many multilingual machine translation. Importantly, LMS is able to have comparable translation performance with much fewer parameters.

computational linguistic, matrix, module, (14 more...)

arXiv.org Artificial Intelligence

2305.13993

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(6 more...)

Genre: Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Filters

Collaborating Authors

switch transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Who Speaks for the Trigger Dynamic Expert Routing in Mixture of Experts Transformers

2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf

Mixture-of-ExpertswithExpertChoiceRouting

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

Tight Clusters Make Specialized Experts

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Emergent Modularity in Pre-trained Transformers

SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Condensing Multilingual Knowledge with Lightweight Language-Specific Modules